All hits all the time: parameter-free calculation of spaced seed sensitivity

نویسندگان

  • Denise Y. F. Mak
  • Gary Benson
چکیده

MOTIVATION Standard search techniques for DNA repeats start by identifying small matching words, or seeds, that may inhabit larger repeats. Recent innovations in seed structure include spaced seeds and indel seeds which are more sensitive than contiguous seeds. Evaluating seed sensitivity requires (i) specifying a homology model for alignments and (ii) assigning probabilities to those alignments. Optimal seed selection is resource intensive because all alternative seeds must be tested. Current methods require that the model and its probability parameters be specified in advance. When the parameters change, the entire calculation has to be rerun. RESULTS We show how to eliminate the need for prior parameter specification by exploiting a simple observation: given a homology model, the alignments hit by a particular seed remain the same regardless of the probability parameters. Only the weights assigned to those alignments change. Therefore, if we know all the hits, we can easily (and quickly) find optimal seeds. We describe an efficient preprocessing step, which is computed once per seed. Then we show several increasingly efficient methods to find the optimal seed when given specific probability parameters. Indeed, we show how to determine exactly which seeds can never be optimal under any set of probability parameters. This leads to the startling observation that out of thousands of seeds, only a handful have any chance of being optimal. We then show how to identify optimal seeds and the boundaries within probability space where they are optimal.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

All Hits All The Time: Parameter Free Calculation of Seed Sensitivity

Standard search techniques for DNA repeats start by identifying seeds, that is, small matching words, that may inhabit larger repeats. Recent innovations in seed structure have led to the development of spaced seeds [8] and indel seeds [9] which are more sensitive than contiguous seeds (also known as k-mers, k-tuples, l-words, etc.). Evaluating seed sensitivity requires 1) specifying a homology...

متن کامل

On spaced seeds for similarity search

Genomics studies routinely depend on similarity searches based on the strategy of finding short seed matches (contiguous k bases) which are then extended. The particular choice of the seed length, k, is determined by the tradeoff between search speed (larger k reduces chance hits) and sensitivity (smaller k finds weaker similarities). A novel idea of using a single deterministic optimized space...

متن کامل

Computing Alignment Seed Sensitivity with Probabilistic Arithmetic Automata

Heuristic sequence alignment and database search algorithms, such as PatternHunter and BLAST, are based on the initial discovery of so-called alignment seeds of well-conserved alignment patterns, which are subsequently extended to full local alignments. In recent years, the theory of classical seeds (matching contiguous q-grams) has been extended to spaced seeds, which allow mismatches within a...

متن کامل

On the complexity of the spaced seeds

Optimal spaced seeds were introduced by the theoretical computer science community to bioinformatics to effectively increase homology search sensitivity. These seeds are serving many homology queries daily. However the computational complexity of finding the optimal spaced seeds remains to be open. In this paper, we prove that computing hit probability of a spaced seed in a uniform homology reg...

متن کامل

PerM: efficient mapping of short sequencing reads with periodic full sensitive spaced seeds

MOTIVATION The explosion of next-generation sequencing data has spawned the design of new algorithms and software tools to provide efficient mapping for different read lengths and sequencing technologies. In particular, ABI's sequencer (SOLiD system) poses a big computational challenge with its capacity to produce very large amounts of data, and its unique strategy of encoding sequence data int...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Bioinformatics

دوره 25 3  شماره 

صفحات  -

تاریخ انتشار 2009